-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DISTINCT + ORDER BY in ARRAY_AGG #14413
base: main
Are you sure you want to change the base?
Conversation
9400d72
to
8eaacd6
Compare
8eaacd6
to
f23681c
Compare
@@ -193,3 +193,149 @@ pub fn merge_ordered_arrays( | |||
|
|||
Ok((merged_values, merged_orderings)) | |||
} | |||
|
|||
#[cfg(test)] | |||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this tests from
mod tests { |
merge_ordered_arrays
function present in this file, and nothing related to ARRAY_AGG
.
As I added there some unit tests that do test ARRAY_AGG
, I though that it might be a good idea to move these ones out to a more suitable place.
if values.len() != 1 { | ||
return internal_err!("expects single batch"); | ||
if values.is_empty() { | ||
return Ok(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As now the distinct accumulator can accept more than 1 batch because of the ordering, removing this restriction was necessary.
Close/reopen to rerun CI |
f23681c
to
0d53376
Compare
// ARRAY_AGG(DISTINCT col ORDER BY other_col) <- Invalid | ||
// ARRAY_AGG(DISTINCT col ORDER BY concat(col, '')) <- Invalid | ||
if acc_args.ordering_req.len() > 1 { | ||
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a copy-paste typo of the error message? Shouldn't it rather be something more like:
return exec_err!("In an aggregate with DISTINCT, only one ORDER BY expression is allowed");
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The restriction goes further than that: all the expressions used in the ORDER BY must appear in as arguments of the aggregation function. In this case, ARRAY_AGG accepts one argument, therefore only that 1 expression is also valid as an ORDER BY expression.
I took the error message from Postgres, but maybe it's better to customize it to the ARRAY_AGG function.
let mut sort_option: Option<SortOptions> = None; | ||
if let Some(order) = acc_args.ordering_req.first() { | ||
if !order.expr.eq(&acc_args.exprs[0]) { | ||
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there can be only one ORDER BY
expression, the error message should refer to it in singular. I'd suggest:
return exec_err!("In an aggregate with DISTINCT, the ORDER BY expression must match the aggregate argument");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there can be only one ORDER BY expression, the error message should refer to it in singular
This was pretty much the error message that Postgres would return, but they have this limitation as a general rule for all aggregation functions, which does not seem to be the case for DataFusion. The other option would be to fully customize the error message for the specific ARRAY_AGG case, which will result in a less standard but more informative error message. Any opinions here @alamb ?
@@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg { | |||
let data_type = acc_args.exprs[0].data_type(acc_args.schema)?; | |||
|
|||
if acc_args.is_distinct { | |||
return Ok(Box::new(DistinctArrayAggAccumulator::try_new(&data_type)?)); | |||
// Limitation similar to Postgres. The aggregation function can only mix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment is great! Would it be possible to make it more user-facing? Maybe as a rustdoc over the ArrayAgg
structure? As well as in the ArrayAgg
documentation if it exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to look for inspiration in other aggregation function docs, and they seem overly brief on purpose, no other doc mentions anything about DISTINCT clause behavior or anything like that. Maybe someone else can chime in and throw some light about what's the best approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice @gabotechs
I suggests to add a few more corner and negative tests
0d53376
to
dbc8a75
Compare
Which issue does this PR close?
Rationale for this change
Completing ARRAY_AGG functionality as a prerequisite for adding the full functionality of STRING_AGG in #14412
What changes are included in this PR?
Adds a Postgres-style support for DISTINCT + ORDER_BY functionality, allowing users to issue statements like:
Note that there's a limitation that prohibits ordering by an expression that is not the same as the ARRAY_AGG argument. For example, the following queries are invalid:
This is the same limitation that exists on Postgres, example in Postgres fiddle
Are these changes tested?
yes, both in unit tests and sqllogictests
Are there any user-facing changes?
Users will now be able to issue ARRAY_AGG calls mixing DISTINCT and ORDER_BY clauses